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Abstract 

We consider zero-sum stochastic games with finite state and action spaces, perfect infor- 
mation, mean payoff criteria, without any irreducibility assumption on the Markov chains 
associated to strategies (multichain games). The value of such a game can be characterized 
by a system of nonlinear equations, involving the mean payoff vector and an auxiliary vector 
(relative value or bias). We develop here a policy iteration algorithm for zero-sum stochas- 
tic games with mean payoff, following an idea of two of the authors (Cochet-Terrasson and 
Gaubert, C. R. Math. Acad. Sci. Paris, 2006). The algorithm relies on a notion of nonlinear 
spectral projection (Akian and Gaubert, Nonlinear Analysis TMA, 2003), which is analogous 
to the notion of reduction of super-harmonic functions in linear potential theory. To avoid 
cycling, at each degenerate iteration (in which the mean payoff vector is not improved), the 
new relative value is obtained by reducing the earlier one. We show that the sequence of values 
and relative values satisfies a lexicographical monotonicity property, which implies that the 
algorithm does terminate. We illustrate the algorithm by a mean-payoff version of Richman 
games (stochastic tug-of-war or discrete infinity Laplacian type equation), in which degenerate 
iterations are frequent. We report numerical experiments on large scale instances, arising from 
the latter games, as well as from monotone discretizations of a mean-payoff pursuit-evasion 
deterministic differential game. 
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1 Introduction 

The mean-payoff problem for zero-sum two player multichain games We consider a 
zero-sum stochastic game with finite state space [n] := {1, . . . , n}, finite action spaces A and B for 
the first and second player respectively, and perfect information. In the case of the finite horizon 
problem, in which the payoff of the game induced by a pair of strategies of the two players is 
defined as the expectation of the sum in finite horizon of the successive rewards (the payments of 
the first player to the second player), Shapley showed (see |Sha03j ) that the value vj of the game 
with horizon T and initial state i £ [n] satisfies the dynamic programming equation v T+1 = f(v T ), 
with a dynamic programming operator / : M. n — > M. n denned as : 

[/(«)]« = mm ( max ( £ i# Vj + rA j , Vi € [n],v £ K™ . 

Here, rf b and represent respectively the reward in state i £ [n] and the transition probability 
from state i to state j £ [n] , when the actions of the first and second players are respectively equal 
to a £ A and b £ B. 

The above dynamic programming operator / is order-preserving, meaning that v < w ==> 
f(v) < f( w ) where < denotes the partial ordering of R n , and additively homogeneous, meaning 
that it commutes with the addition of a constant vector. These two conditions imply that / 
is nonexpansive in the sup-norm (see for instance |CT80j . see also [GG04 for more background 
on this class of nonlinear maps). Moreover, / is polyhedral, meaning that there is a covering of 



E™ by finitely many polyhedra such that the restriction of / to any of these polyhedra is affinc. 
Kohlbcrg [Koh80 showed that if / is a polyhedral self-map of R." that is nonexpansive in some 
norm, then, there exist two vectors rj and v in K. n such that f(trj + v) = (t + l)rj + v, for all t £ K 
large enough. A map of the form 1 1— > tr\ + v is called a half-line, and r/ is its slope. It is invariant 
if it satisfies the latter property. Moreover this property is equivalent to the following system of 
nonlinear equations for the couple (rj, v) : 

1 + V = fr,{v) , 

where the maps / (the recession function) and f v are constructed from / (see Section [2]). 

When / has an invariant half-line with slope ij, the growth rate of its orbits (also called the 
cycle time) x(/) := hm/s-^oo f k (v)/k exists and is equal to rj. Here, f k denotes the k-th iterate of 
/, and v is an arbitrary vector of R n . This shows in particular that the value of the finite horizon 
game satisfies liniT^oo vj /T = r\{ for any final reward. Moreover, r\i gives the value of the game 
with initial state i, and mean payoff, that is such that the payoff of the game induced by a pair of 
strategies of the two players is the Cesaro limit of the expectation of the successive rewards. Then 
a vector v such that t 1— > tr\ + v is an invariant half-line is called a relative value of the game, or 
bias. It is not unique, even up to an additive constant. 

In this paper, we give an algorithm to find an invariant half-line, or equivalently a solution 
of ([lj, for general multichain games. This allows us in particular to determine the mean payoff, as 
well as optimal strategies for both players. By multichain, we mean that there is no irreducibility 
assumption on the Markov chains associated to the strategies of the two players, which may have 
in particular several invariant measures. 



Classes of games solvable by earlier policy iteration algorithms Policy iteration is a 
general method initially introduced by Howard How60] in the case of one player problems (Markov 
decision processes). The idea is to compute a sequence of strategies as well as certain valuations, 
which serve as optimality certificates, and to use the current valuation to improve the strategy. The 
algorithm bears some resemblance with the Newton method, as the strategy determines a sub or 
super-gradient of the dynamic programming operator. The key of the analysis of policy iteration 
algorithms is generally to show that the sequence of valuations which are computed satisfies a 
monotonicity property, from which it can be inferred that the same strategy is never selected 
twice. In the discounted one player case, the valuation which is maintained by the algorithm is 
nothing but the value vector of the current policy. For one-player games with mean-payoff, in the 
unichain case (in which every stochastic matrix associated to a strategy has only one final class), 
the valuation consists of the mean payoff of the current strategy, as well as of a relative value. 
In both cases, the monotonicity property is natural (it relies on the discrete maximum principle, 
or properties of monotonicity and contraction of the dynamic programming operator, or on the 
uniqueness of the invariant measure associated to a strategy). However, even for one player games, 
the extension to the multichain case is more difficult. It was initially proposed by Howard |How60] . 
The convergence of his method was established by Denardo and Fox |DF68) . 

The idea of extending Howard algorithm to the two player case appeared independently in the 
work of Hoffman and Karp [HK66 for a subclass of mean-payoff games with imperfect informa- 
tion, and in the work of Denardo [Den67| for discounted games. Both algorithms consist of nested 
iterations; the internal iterations are a simplified version of the one player Howard algorithm. The 
algorithm for discounted games appeared also, as an adaptation of the Hoffman-Karp algorithm, 
in the work of Rao, Chandrasekaran, and Nair [RCN731 Algorithm 1] and of Puri |Pur95j (deter- 
ministic games with perfect information). More recently, Raghavan and Sycd | RS03J developed a 
related algorithm in which strategy improvements involve only one state at each iteration. 

The Hoffman-Karp algorithm requires the game to satisfy a strong irreducibility assumption 
(each stochastic matrix arising from a choice of strategies of the two players must be irreducible). 
Without an assumption of this kind, degenerate iterations, at which the mean payoff vector is not 
improved, may occur, and so the algorithm may cycle (we shall indeed see such an example in 
Section [6]). This pathology appears in particular for the important subclass of deterministic mean 
payoff games, for which the irreducibility assumption is essentially never satisfied. 
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A natural idea to solve mean-payoff games, appearing for instance in the work of Puri |Pur95] . 
is to apply the policy iteration algorithm of Denardo |Den67j or Rao, Chandrasekaran, and 
Nair |RCN73( Algorithm 1] for discounted games, choosing a given discount factor a sufficiently 
close to one, which allows one to determine the so-called Blackwell optimal policies. For determin- 
istic games, when the rewards are integers with modulus less or equal to W and the number of 
states is equal to n, Zwick and Paterson [ZP96] showed that taking 1 — a = l/(4n 3 W) is sufficient 
to determine the mean payoff by a rounding argument. However this requires high precision arith- 
metics. In the case of stochastic games, the situation is even worse, since examples are known in 
which the value of 1 — a to be used for rounding has a denominator exponential in the number 
of states. In particular, an approach of this kind is impracticable if one works in floating point 
(bounded precision) arithmetics. Hence, it is desirable to have a policy iteration algorithm for 
multichain stochastic games relying only on the computation of mean payoffs and relative values 
as in the algorithm of Howard [Ho w60j and Denardo and Fox (DF68j . 

The first policy iteration algorithm not relying on vanishing discount, for general (multichain) 
deterministic mean payoff games, was apparently introduced by Cochet-Terrasson, Gaubert and 
Gunawardena CTGG99 , GG98] . The former reference concerns the special case in which the mean 
payoff is the same for all states at each iteration, whereas the second one covers the general case, 
see also [CT01 . Details of implementation, as well as experimental results were given in |DG06j . 
The idea of the algorithm of [CTGG99 is to handle degenerate iterations by a tropical (max-plus) 
spectral projector. The latter is a tropically linear retraction of the whole space onto the fixed 
point set of the dynamic programming operator associated to a given strategy of the first player. 
When the mean payoff or the current strategy is not improved, the new relative value is obtained 
by applying a spectral projector to the earlier relative value. The proof of termination of the 
algorithm CTGG99] relies on a key ingredient from tropical spectral theory, that a fixed point 
of a tropically linear map is uniquely defined by its restriction to the critical nodes (the nodes 
appearing infinitely often in a strategy which gives the optimal mean payoff). Then, it was shown 
in [CTGG99 that at each degenerate iteration, the relative value decreases, and that the set of 
critical nodes also decreases, from which the termination of the algorithm can be deduced. 

A related class of games consists of parity games, which can be encoded as special determinis- 
tic games with mean payoff. A policy improvement algorithm for parity games was introduced 
by Voge and Jurdzihski jVJOO] , This algorithm differs from the one of |CTGG991 IGG98] in 
that instead of the relative value, the algorithm maintains a set of relevant reachable vertices. 
Other policy algorithm for parity games or deterministic mean payoff games were introduced 
later on by Bjorklund, Sandberg and Vorobyov |BSV041 IBV07] . and by Jurdzihski, Paterson, 
and Zwick |JPZ06j . An experimental comparison of algorithms for deterministic games was re- 
cently made by Chaloupka [Chall, Cha09 , who also gave an optimized version of the algorithm 
of |CTGG99l IGG981 lDU06j . 

In [BCPS04) . Bielecki, Chancelier, Pliska, and Sulem used a policy iteration algorithm to solve a 
semi-Markov mean-payoff game problem with infinite action spaces obtained from the discretization 
of a quasi- variational inequality, based on the approach of CTGG99, GG98 . Their algorithm 
proceeds in a Hoffman and Karp fashion. 



Policy iteration algorithm for stochastic multichain zero-sum games with mean payoff 

Inspired by the policy iteration algorithm of [CTGG991 IGG98| for deterministic games, Cochet- 
Terrasson and Gaubert proposed in |CTG06j a policy iteration algorithm for general stochastic 
games (see also [CTOlJ for a preliminary version). The relative values are now constructed using 
the nonlinear analogues of tropical spectral projectors. These nonlinear projectors where introduced 
by Akian and Gaubert in [AG03] . They can be thought of as a nonlinear analogues of the operation 
of reduction of a super-harmonic function, arising in potential theory. However, no implementation 
details were given in the short note |CTG06j , in which the algorithm was stated abstractly, in terms 
of invariant half-lines. 

We develop here fully the idea of |CTG06j . and describe a policy iteration algorithm for mul- 



tichain stochastic games with mean-payoff (see Section 4.2 1. We explain how nonlinear systems of 
the form ([I]) are solved at each iteration. We show in particular how non-linear spectral projections 
can be computed, by solving an auxiliary (one player) optimal stopping problem. This relies on 
the determination of the so called critical graph, the nodes of which (critical nodes) axe visited 
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infinitely often (almost surely) by an optimal strategy of a one player mean payoff stochastic game. 



An algorithm to compute the critical graph, based on results on j.\( II KS . is given in Section 5.3 



We give the proof of the convergence theorem (which was only stated in [CTG06J). In particular, 
we show that the sequence [rj^ k \v^ k > ,C^- k ') consisting of the mean payoff vector, relative value 
vector, and set of critical nodes, constructed by the algorithm satisfies a kind of lexicographical 
monotonicity property so that it converges in finite time (see Section 4.3 ). The proof of convergence 
exploits some results of spectral theory of convex order-preserving additively homogeneous maps, 
by Akian and Gaubert }AG03| . Hence, the situation is somehow analogous to the deterministic 
case [CTGG99 , the technical results of tropical (linear) spectral theory used in CTGG99J being 
now replaced by their non-linear analogues I AG03j . Note also that the convergence proof of the 
algorithm of CTGG99, GG98J can be recovered as a special case of the present proof. 

The convergence proof leads to a coarse exponential bound on the execution time of the algo- 
rithm: the number of iterations of the first player is bounded by its number of strategies, and the 
number of elementary iterations (resolutions of linear systems) is bounded by the product of the 
number of strategies of both players. 

We also show that the specialization of this algorithm to a one-player game gives an algorithm 
which is similar to the multichain policy algorithm of Howard and Denardo and Fox, see Section 
Then, we discuss an example (see Section [6]) involving a variant of Richman games 
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(also called stochastic tug-of-war [PSSW09 , related with discretizations of the infinity Lapla- 
cian [Obc05 ), showing that degenerate iterations do occur and that cyclic may occur with naive 
policy iteration rules. Hence, the handling of degenerate iterations, that we do here by nonlinear 
spectral projectors, cannot be dispensed with. 

The present algorithm has been implemented in the C library PIGAMES by Detournay, 
see |Detl2j for more information. We finally report numerical experiments (see Section [7]) car- 
ried out using this library, both on random instances of Richman type games with various numbers 
of states and on a class of discrete games arising from the monotone discretization of a pursuit- 
evasion differential game. These examples indicate that degenerate iterations are frequent, so that 
their treatment cannot be dispensed with. They also show that the algorithm scales well, allowing 
one to solve structured instances with 10 6 nodes and 10 7 actions in a few hours of CPU time on a 
single core processor (the bottleneck being the resolution of linear systems). 

We note that our experimental are consistent with earlier experimental tests carried out for 
simpler algorithms (dealing with one player or deterministic problems) of which the present one 
is an extension. These tests indicate that policy iteration algorithms are fast on typical instances 
(although instances with an exponential number of iterations have been recently constructed, as 
discussed in the next subsection). Indeed, in the case of one-player deterministic games (max- 
imal circuit mean problem), Dasdan, Irani and Guptka [DIG98J concluded that the instrumen- 
tation of Howard's policy iteration algorithm by Cochet-Terrasson et al. [CTCG + "98) . in which 
each iteration is carried out in linear time, was the fastest algorithm on their test suite. Das- 
dan latter on developed further optimizations of this method Das04 . More recent experiments 
by Georgiadis, Goldberg, Tar j a n, and Wer neck |GGTW09] have indicated that the class of cy- 
cle based algorithms (to which [CTCG + ^8l IDas04j belongs) is among the best performers, close 
second to the tree based method of Young, Tarjan, and Orlin |YTQ91j . In the determinis- 
tic two player case, Chaloupka Cha09 compared several algorithms and observed that the one 
of |CTGG99llGG98[|DG06] . with the optimization that he introduced (see also [Chall] ). is exper- 
imentally the best performer. 



Alternative algorithms and complexity issues Gurvich, Karzanov and Khachiyan GKK88J 
were the first to develop a combinatorial algorithm {pumping algorithm) to solve zero-sum de- 
terministic games with mean payoff. An alternative approach was developed by Zwick and Pa- 
terson [ZP96 . who showed that such a game can be solved by considering the finite horizon 
game for a sufficiently large horizon, and applying a rounding argument. Both algorithms are 
pseudo-polynomial. Other algorithms, also pseudo-polynomial, based on max-plus (tropical) cyclic 
projections, with a value iteration flavor, have been developed by Butkovic and Cuninghame- 
Green |CGB03j . Gaubert and Sergeev [GS07] . and Akian, Gaubert, Nitiga and Singer [AGNSllj . 
Deterministic mean payoff games have been recently proved to be equivalent to decision problems 
for tropical polyhedra (the tropical analogue of linear programming) [AGG12) . More generally, the 
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results there show that stochastic games problems with mean payoff can be cast as tropical convex 
(non-polyhedral) programming problems. 

The pumping algorithm of [GKK88] was recently extended to the case of stochastic games 
with perfect information by Boros, Elbassioni, Gurvich, and Makino BEGM10J. They showed 
that their algorithm is pseudo-polynomial when the number of states of the game at which a 
random transition occurs remains fixed. No pseudo-polynomial seems currently known without 
the latter restriction. Their algorithm applies to more general games than the ones covered by the 
irreducibility assumption of Hoffman and Karp in [HK 66I , but it does not apply to all multichain 
games. 

The question of the complexity of deterministic mean payoff games was raised in GKK88 , and 
it has remained open since that time. Note in this respect that such games are known to have 
a good characterization in the sense of Edmonds, i.e., to be in NPncoNP. Indeed, the strategies 
of one player can be used as concise certificates, as observed by Condon |Con92j . Paterson and 
Zwick [ZP96 . Such games even belong to the class UPncoUP as shown by Jurdzihski j Jur98j . 
We refer the reader to the discussion in BSV04, JPZ06J for more information. The arguments of 
Condon |Con92j also imply that zero-sum stochastic games with perfect information (and finite 
state and action spaces) belong to NPncoNP. An important subclass of deterministic games with 
mean payoff consists of parity games. These can be reduced to mean payoff deterministic games 
(Puri |Pur95| ) . which in turn can be reduced to discounted deterministic games. The latter ones 
can be reduced to simple stochastic games (Zwick and Paterson [ZP96) ). In |AM09| . Andersson 
and Miltersen generalized this result showing that stochastic mean payoff games with perfect 
information, stochastic parity games and simple stochastic games arc polynomial time equivalent. 
In particular, the decision problem corresponding to a game of any of these classes lies in the 
complexity class of NPncoNP. 

Friedmann has recently constructed an example [Fri09j showing that the Voge-Jurdjirisky strat- 
egy improvement algorithm for parity games IV. 101) may require an exponential number of iter- 
ations. This also yields an exponential lower bound [Frill] for the Hoffman-Karp strategy im- 
provement rule for discounted deterministic games |Pur95j . The result of Friedmann has also 
been extended to total reward and undiscounted MDP by Fearnley |Feal0a[ FcalOb and to simple 
stochastic games and weighted discounted stochastic games by Andersson [And09| . 

Moreover, for Markov decision process with a fixed discount factor, some upper bound on the 
number of policy iterations was given in |MH86j . Recently, Ye gave a the first strongly polynomial 
bound Ye05, [Yell] . The latter bound has been improved and generalized to zero-sum two player 
stochastic games with perfect information factor by Hansen, Miltersen and Zwick in HM Zllj . 
again for a fixed discount factor, giving the first strongly polynomial bound for these games. Note 
that a polynomial bound for mean payoff games does not follow from these results (to address the 
mean payoff case, we need to consider the situation in which the discount factor tends to 1). 

Complexity results of a different nature have been established with motivations from numerical 
analysis (discretizations of PDE) , exploiting in particular the relation between policy iteration and 
the Newton method. The policy iteration algorithm for one-player discounted games with an infinite 
number of actions has been proved to have a superlinear convergence around the solution under 
suitable assumptions (see in particular the works of Puterman and Brumclle [PB79 , Akian [Aki90] . 
and Bokanowski, Maroso, and Zidani [B MZ09] ). Chancelier, Messaoud, and Sulem [CMS07 also 
considered, in view of their application to quasi-variational inequalities, partially undiscounted 
infinite horizon problems for which they proved the contraction of the policy iteration algorithm. 

The plan of the paper is the following: Section[2]is recalling some background on stochastic zero- 
sum two player games, Section[3]explain the construction of the nonlinear projection, Section[4]gives 
the algorithm, its practical version and its proof, Section [5] gives the ingredients of the algorithm, 
Section[6]shows an example with possible cycling of iterations when not using the notion of spectral 
projector, and Section [7] is for the numerical experiments. 
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2 Two player zero-sum stochastic games with discrete time 
and mean payoff 

The class of two player zero-sum stochastic games was first introduced by Shapley in the early 
fifties, see |Sha03 . We recall in this section basic definitions on these games in the case of finite 
state space and discrete time (for more details see |Sha03[ IFV971 ISor02] ). 

We consider the finite state space [n] := {1 , . . . , n}. A stochastic process (£fc)fc>n on [n] gives 
the state of the game at each point time k, called stage. At each of these stages, two players, called 
"min" and "max" (the minimizer and the maximizer) have the possibility to influence the course 
of the game. 

The stochastic game r(i ) starting from i £ [n] is played in stages as follows. The initial state 
£o is equal to ia and known by the players. Player min plays first, and chooses an action «o in a 
set of possible actions A^ . Then the second player, max, chooses an action (3q in a set of possible 
actions B^ . The actions of both players and the current state determine the payment r^°^° made 

by min to max and the probability distribution j n- P?°f of the new state £1. Then the game 
continues in the same way with state £1 and so on. 

At a stage k, each player chooses an action knowing the history defined by £fc = (£o,ao)/3o) 
• • • 7 Cfc-i; ct/c-i, /3fe-i>£fc) for MIN and (£fc, a*,) for max. We call a strategy or policy for a player, a 
rule which tells him the action to choose in any situation. There are several classes of strategies. 
Assume Ai C A and Bi C B for some sets A and B. A behavior or randomized strategy for MIN 
(resp. max) is a sequence a := (oo, oi, • • • ) ( re sp. 5 :— (5q, Si, ■ ■ ■ )) where (resp. 5 k ) is a 
map which to a history h k = (i , a , b Q , . . . , i k -\, a k -i, bfe-i, ik) with i e £ [n], a e £ A H , b e £ B l£ for 
< £ < k (resp. (hk-, a k )) at stage k associates a probability distribution on a probability space over 
A (resp. B) which support is included in the possible actions space Ai k (resp. Bi k ). A Markovian 
strategy is a strategy which only depends on the information of the current stage k: a k (resp. 5 k ) 
depends only on i k (resp. (ife,afe)), then a k (hk) (resp. o~k(h k ,ak)) will be denoted crfc(ife) (resp. 
fik(iki °>k))- It is sa id stationary if it is independent of k, then a k is also denoted by a and S k by 
S. A strategy of any type is said pure if for any stage k, the values of o~k (resp. 5k) are Dirac 
probability measures at certain actions in Ai k (resp. Bi k ) then we denote also by o~k (resp. 5k) the 
map which to the history assigns the only possible action in Ai k (resp. Bi k ). 

In particular, if a is a pure Markovian stationary strategy, also called feedback strategy, then 
<7 = ( cr /c)fc>o w hh o~k — o~ for all k and a is a map [n] — > A such that o~(i) € Ai for all i € [n]. In 
this case, we also speak about pure Markovian stationary or feedback strategy for a and we denote 
by Am the set of such maps. We adopt a similar convention for player MAX : Bm ■= {5 : [n] x A — >• 
B | 5(i,a) £ BiVi G [n], a e Ai}. 

A strategy a = (o~k)k>o ( res P- ^ = (^fe)fc>o) together with an initial state determines stochastic 
processes (afc) fe>0 f° r the actions of MIN, (/3fc) /c>0 for the actions of MAX and (£,k)k>o f° r * ne states 
of the game such that 



where C^. := (£ , ao, ft®, . . . , (,k-i, ctk-i, h-i£,k) is the history process, hk is a history vector at time 
k: hk = («0j uq, bo, ... , ik~i,ak-i, 6fe-i, i) and A 1 (resp. B') are measurable sets in A (resp. B). For 
instance, for each pair of feedback strategies (a, 5) of the two players, that is such that for k > : 
o~k — o~ with a £ Am and 5k = 5 with 5 £ Bm, the state process (£fc)fc>o is a Markov chain on [n] 
with transition probability 



P(£k+i =j\(k = hk, ctk = a, f3k = b) 
P{a k £A'\Ck = h k ) 
P(j3k£B , \Ck^hk,a k = a) 



>a b 



(2a) 
(2b) 
(2c) 



<rk(h k )(A') 
5 k (h k ,a)(B') , 



P(&+1 = J I a- = i) = W) for i,j £ [n] , 



and a k = cr(^ fc ) and f3 k = 5(^ k ,a k ). 

When the strategies a for MIN and 5 for MAX are fixed, the payoff in finite horizon r of the 
game Y(i, a, 5) starting from i is 



r(i,a,5) = Ef Y, r t k 



. fe=0 



G 



where E^ ,<5 denotes the expectation for the probability law determined by ([2]). The mean payoff of 
the game T(i, a, 5) starting from i is 

J(i,cr,$) = limsup — J T (i,a,S). 

When the action spaces Ai and Bi are finite sets for all i G [n], the finite horizon game and the 
mean payoff game have a value which is given respectively by: 

vj = inf sup J T (i, a, 5), (3) 

* 5 

and 

Pi = inf sup J(i, a, 5), (4) 

* 6 

for all starting state i € [n], where the infimum is taken among all strategies a for MIN and the 
supremum is taken over all strategies 5 for MAX (see |Sha03j for finite horizon games, and [LL69J 
for mean payoff games). 

Indeed, the value v T of the finite horizon game satisfies the dynamic programming equation |Sha03j : 

< +1 = mm ( max ( £ Pf}> »T + r f j V Vt G [n], (5) 

with initial condition = 0, i £ [n]. Moreover, optimal strategies are obtained for both players by 
taking pure Markovian strategies a for MIN and 5 for MAX such that, for all k = 0, . . . , r — 1, and i in 
[n], CTfe(i) attains the minimum in ([5| with t replaced by r — fc— 1, and that, for all k = 0, . . . , r — 1, 
i in [n] and a in Aj, <5fc (i, a) attains the maximum in the expression of F(v T ^ k ^ 1 ;i,a) defined as 
follows: 

F(v; i, a) = max I £ i^f Wj + rf . (6) 

* yeM / 

We denote by / the dynamic programming or Shapley operator from R™ (that is here equivalent 
to Rl ra l) to itself given by: 

[f(v)]i := F(v; i) := min F(v;i,a), Vie [n], d£K". (7) 

Then, the dynamic programming equation of the finite horizon game writes: 

v T+1 = f(v T ). (8) 

The operator / is order-preserving, i.e. v < w f(v) < f(w) where < denotes the partial 

ordering of R™ (v < w if Vi < Wi for all i € [n]), and additively homogeneous, i.e. it commutes 
with the addition of a constant vector, which means that /(A + v) = A + /(u) for all A G M 
and v € R n , where A + v = (A + i^jgui. This implies that / is nonexpansive in the sup-norm 
(see for instance [CT80] ). Note that it was observed independently by Kolokoltsov [Kol92], by 
Gunawardena and Sparrow (see |Gun03j ) and by Rubinov and Singer [RSOd] that, conversely, if 
/ : f — > R n is order-preserving and additively homogeneous, then / can be put in the form (6j7|, 
with possibly infinite sets Ai and Bi. 

When the action spaces Ai and Bi are finite sets for all i € [n], the map / is also polyhedral, 
meaning that there is a covering of R." by finitely many polyhedra such that the restriction of / 
to any of these polyhedra is affine. Kohlberg |Koh80j showed that if / is a polyhedral self-map 
of K™ that is nonexpansive in some norm, then, there exist two vectors v and n in K™ such that 
f(tn + v) = (t + l)n + v, for all t E R large enough. A map lo : t e [£q,oo) ^ tn + v e R™, 
with to G R, and rj,v G R™, is called a half-line with slope rj. A germ of half-line at infinity is an 
equivalence class for the equivalence relation on half-lines oj ~ uj' if ui(t) = w'(t) for t G R large 
enough. A germ can be identified with the couple (n, v) of vectors of R™. Hence, in the sequel, we 
shall use the expression "half-line" either for a map u> : t G [to, oo) h- >■ tr\ + v € R™, for its germ, or 
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for the couple (77, v). We shall say that it is invariant by / if it satisfies the latter property, that is 
f(tr) + v) = (t + 1)77 + v, for all t € K large enough. The interest of an invariant half-line is that its 
slope determines the growth rate of the orbits of /, x(f) '■= lirnfc^oo f k {w)/k. Here, f k denotes the 
k-th iterate of /, and w is an arbitrary vector of M. n . When it exists, the growth rate x(/) is called 
the cycle time of /. Indeed, if f(trj + v) — (t + l)i] + v for t > to, then f k (tQT] + v) = (to + k)n + v for 
k > 0, hence lim^oo f k (toi] + v)/k = 77, and by the nonexpansiveness of /, lim/ c _ J . 00 f k (w)/k = 77 
for all w G M. n , that is x(f) does exist and is equal to 77. For the game problem this shows that the 
value of the finite horizon game has a linear growth with respect to time: 

lim - vj = [x{f)]i = m , 

r—too T 



where / is the Shapley operator defined in (6|7|. Moreover, the value p of the mean payoff game 



defined in Q coincides with the slope of an invariant half-line of /, and thus with the former limit: 

Pi = Vi = [x(f)]i- 



Finally, when the action spaces are finite, one can easily see that the Shapley operator / in ( 6|7 ) 
satisfies for all 7/, v G R n , 

f(tr) + v) = tf( V ) + /» for t large, (9) 
where / is the recession function of / (see |GG04p : 

[/(„)]< := l im UMi = rnin max [ £ P t f r,A , i € [n] , (10) 

\je[n] / 

and f v is what we shall call the tangent of f at infinity around the slope 77: 

[/»]« := lim [f(t V + v)- tf(r,)}i = min max £ Ptf Vj + rf , (11a) 

with 

Ai iV := argmin /max P$ Vj \ j (Hb) 



B WI := argmax \ £ P$ Vj \ . (11c) 



Indeed, for an action a G A4 and i G [n], we have from the finiteness of the sets Bi 



r ab 



F(tt] + v- i, a) = max £ P? 6 (tfy + Uj ) 

G ' ye[n] 

ie[n] 



max 



,, y , ' E ' is + E «i + < 6 ) 4 lar § e 

je[n] / \je[n] 

tF(r); i, a) + F v (v; i, a) 



<s 



where one denotes 



F(ri;i,a) := max £ P« 6 % (12) 



u'e[n] 



f>;i,a) := max [ £ P?.%- + rf ] . (13) 



Then, using the finiteness of the sets Ai, and 

[/(*'7 + v )]i = FCtn + v;i) = min Fitri + v; i, a) , 

one obtains Equation 

From ([9]), we deduce easily that (rj, v) is an invariant half-line of / if, and only if, it satisfies: 

' =/ M' d4) 

V + V = f v (V) . 

This couple system of equations is what is solved in practice, when one looks for the value function 
p = rj of the mean payoff game. 

3 Reduced super-harmonic vectors 

We next present the non-linear analogue of a result of classical potential theory, on which the 
policy iteration algorithm for mean payoff games relies. Recall that a self-map / of R™ is order- 
preserving if v < w =4> f(v) < f(w), where < denotes the partial ordering of R™, and that it 
is additively homogeneous if it commutes with the addition of a constant vector. More generally, 
it is additively subhomogeneous, if /(A + v) < A + f(v) for all A > and v G R™. It is easy to 
see that an order-preserving self-map / of R n is additively subhomogeneous if, and only if, it is 
nonexpansive in the sup-norm. (See for instance [GG04] for more background on order-preserving 
additively homogeneous maps.) 

We shall now recall some definitions and results of AG03, where the corresponding proofs 
can be found, up to an extension from additively homogeneous maps to subhomogeneous maps as 
in |AG03[ §1.4]. To show the analogy with potential theory, we shall say that a vector u G K™ 
is harmonic with respect to an order preserving, additively (sub)homogeneous map g of K." if it 
is a fixed point of <?, i.e. if g(u) — u, and that it is super-harmonic if g(u) < u. ( AG03! deals 
more generally with additive eigenvectors and super-eigenvectors). We shall denote by J$?(g) and 
J^ + (g) the set of harmonic and super-harmonic vectors respectively. 

We say that a self-map g of R™ is convex if all its coordinates gi : 1" — > R are convex functions. 
Then, the subdifferential of g at a point u € R™ is defined as 

dg(u) := {M £ R" x ™ | g(v) - g(u) > M(v - u), Vw G R' 1 } . 

Hence, 

dg{u) = {M g R nx ™ | M L G d gi (u)} , (15) 

where Mi. denotes the i-th row of the matrix M. It can be checked that when g is order-preserving 
and additively homogeneous (resp. subhomogeneous), dg(u) consists of stochastic (resp. substochas- 
tic) matrices, that is matrices with nonnegative entries and row sums equal to 1 (resp. less or equal 
to 1), see |AG03[ Cor. 2.2 and (4)]. Assume g has a harmonic vector u. We say that a node is 
critical if it belongs to a recurrence class of some matrix M G dg{u) 1 where a recurrence class of 
M means a (final) communication class F of M such that the F x F submatrix of M is stochastic 
(note that a recurrence class may not exist if g is not additively homogeneous), see [ AG O 3] §2.3 
and 1.4]. One defines also the critical graph Q c {g) of g as the union of the graphs of the F x F 
submatrices of the matrices M G dg(u), such that F is a recurrence class of M. The set of critical 
nodes and the critical graph of g are independent of the choice of the harmonic vector u [AG031 
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Prop. 2.5]. Indeed, when g arises from a stochastic control problem with ergodic reward, a node is 
critical iff it is recurrent for some stationary optimal strategy. 

If I is any subset of [n], we denote by 77 the restriction from K™ to M. 1 , such that (riv)i := Uj, 
for all i £ I. For all u £ M. n , we define uj := rju, and for all self-maps g of M. n , we define gj := r/oj. 
Let J := [n] \ /. We denote by ij the canonical map identifying M. 1 x M. J to K™, which sends (w, z) 
to the vector u such that Ui = Wi for all i £ / and Uj = Zi for all i £ J. Then, the transpose rj 
of n is the map from R 7 to R" such that r}(w) = %i(w,0). Finally, for all I,Jc [n], and for all 
n x n matrices M, we denote by Mxj the I x J submatrix of M. 

Lemma 1. Lei g denote a convex, order preserving, and additively homogeneous self-map o/K™. 
Assume that u £ K™ is harmonic with respect to g. Denote by C the set of critical nodes of g and by 
N = [n] \ C its complement in [n] . Then, the map h : M. N — > R w with h(w) := (r^r ° g ° In)(w, Uq) 
has a unique fixed point. 

Proof. Since the map g is order preserving and additively homogeneous, it is nonexpansive in the 
sup-norm, and so, the map h is also order preserving and nonexpansive in the sup-norm, hence it 
is additively subhomogeneous. Since u is harmonic with respect to g, that is a fixed point of g, 
un is a fixed point of the map h. A classical result of convex analysis (Theorem 23.9 of |Roc70| ) 
shows in particular that if G is a finite valued convex function defined on M. d , if A is a linear map 
R p R d , and if H(v) := G{Av), then, dH(v) = A*dG{Av). Applying this result to every convex 
map Gi defined on W 1 such that Gi(w) := gi(w + in(0, «c)), with i £ N, and to the linear map 
A = r* N , we deduce that dhiiuN) is the projection on M. N of the subdifferential of Gi at the point 
t%{un), or equivalently of the subdifferential of gi at the point r^(ujv) + in(0, «c) = «(uat, uc) _ 



u. Using (151, this implies that dh(u N ) = {M NN \ M £ dg(u)}. Since g is order preserving 
and additively homogeneous, the elements of dg(u) are stochastic matrices, and by the above 
equality, or since h is order preserving and additively subhomogeneous, the elements of dh(uN) are 
substochastic matrices. Recall that the set of critical nodes of h is defined as the set of nodes that 
belong to a final class F of some matrix P £ dh(uif) satisfying that Pff is stochastic. Denote by 
F such a class. We have F C N. Moreover, since dh{ujs[) — {Mnn \ M £ dg(u)}, we can find 
a matrix Q £ dg{u) the N x N submatrix of which, Qnn, coincides with P. Since F C N, Qff 
coincides with Pff- Hence Qff is a stochastic matrix, which implies that F is a recurrent class 
of Q. This shows that the nodes of F are critical nodes of g, which contradicts the fact that the 
set of critical nodes is C since F C N — [n] \ C. Therefore the set of critical nodes of h is empty. 
It follows from Corollary 1.3 of [AG03] that h has a unique fixed point. □ 

We shall need the following result of |AG03| . 

Lemma 2 ([AG03, (7) and Lemma 3.3]). Let g be a convex order-preserving additively homogeneous 
self-map ofR n , with at least one harmonic vector. Denote by C the set of critical nodes. If u is 
super-harmonic with respect to g, then g(u) — u on C , and g u {u) :— lim^oo g k (u) exists, is 
harmonic with respect to g and coincides with u on C . Moreover, the map g" : Jif + (g) — > J4?(g) 
is order-preserving, additively homogeneous, convex, and is a projector. 

The following result gives other characterizations of g u (u) that allows one to compute it effi- 
ciently. 

Theorem 3. Let g denote a convex, order preserving, and additively homogeneous self-map ofW 1 . 
Assume that g admits at least one harmonic vector. Let C denote the set of critical nodes of g, 
and let N denote its complement in [n], N = [n] \ C . For a super-harmonic vector u, the following 
conditions define uniquely the same vector v: 

(i) v = g u (u) := Unife-^ g k {u); 

(ii) v is harmonic and coincides with u on C ; 

(iii) v coincides with u on C and its restriction to N is a fixed point of the map h : w 1— > 
(r N o g oi N )(w,u c ); 

(iv) v is the smallest super-harmonic vector that dominates u on C . 
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Proof. (i)=>(ii): This follows from Lemma [2j 

(ii) =>(iii): Assume that the vector v is harmonic and coincides with u on C and let h be defined 
as in Point (iii). Then, vn — h(vj\r), showing that vn is a fixed point of h. 

(iii) =>(i): Let v and h be as in Point (iii), hence vq = uq and is a fixed point of h. By Lemma[2j 
w := g u {u) is harmonic with respect to g and wc = uc- Applying Lemma [I] to g and w (instead 
of u), and using wc = uc, we get that the fixed point of h is unique, and thus equal to wn- This 
shows that vn = ivn> an d since vc = uc = wc, we get that v = w = g u (u), that is Point (i). 

(ii) =>(iv): Let v be as in Point (ii). Since v is harmonic and coincides with u on C, it is super- 
harmonic and dominates u on C. By ((ii)=>(iii)), ujy is a fixed point of h, with h as in Point 

(iii) . Assume now that w is super-harmonic and dominates u on C, that is wc > uc- Then, 
w > ff( w )> an( i since g is order preserving, iun > gN(wN ,wc) > gN(wN,uc) = /i(iyjv)- Since /i 
is order-preserving, we deduce from wjy > h(uiN) that u>at > /i 1 (wjv) > }i 2 (wn) > ■ ■ ■ . Since ft, is 
nonexpansive and admits a fixed point, every orbit of /i is bounded. Hence, h k (wN) has a limit as 
fc tends to infinity, and this limit is a fixed point of h. Applying Lemma [l] to g and v (instead of 
u), and using vq = uc, we get that the fixed point of h is unique and equal to Ujy. It follows that 
wn > vjf. Since v coincides with u on C and wc > up, we deduce that w > v. This shows that w 
is the smallest super-harmonic vector that dominates u on C. 

(iv) =>(ii): Let v be a minimal super-harmonic vector that dominates u on C (or the smallest one 
if it exists). Since v is a super-harmonic vector, that is g(v) < w, and 3 is order-preserving, we get 
that g(g(v)) < g(v), which shows that g(v) is also super-harmonic. Moreover, by Lemma[2j g(v) 
coincides with v on C, hence it dominates u on C . Since g(v) < v, the minimality of v implies 
g(v) = v, which shows that v is harmonic. Since u and v are super-harmonic vectors and g is 
order-preserving, we get that the infimum v A u of v and u is also a super-harmonic vector. Since 
u dominates u on C, we get that v Au equals u on C . Hence by the minimality of v, and v Au < v, 
we obtain that v — v Au, hence v < u. This implies that v equals u on C, hence v satisfies (ii). □ 

Let g u be defined as in Theorem [3] When g[v) = Mv is a linear operator, and M is a 
stochastic matrix, g 1 * 3 (u) coincides with the reduced super-harmonic vector of u with respect to the 
set C. When g is a max-plus linear operator, the operator g u coincides with the spectral projector 
which has been defined in the max-plus literature, see |CTGG99j . For this reason, we call g u the 
(nonlinear) spectral projector of g. 

We now define a spectral projector acting on half-lines. We assume that g is a polyhedral, 
convex, order preserving, and additively homogeneous self-map of R™. This implies in particular 
that for all i £ [n], the domain of the Legendre-Fenchel transform g* of the coordinate gi of g is 
included in the set of stochastic vectors, and that gi is the Legendre-Fenchel transform of g*, hence 
can be put in the same form as in ([6]): 

gdv) = max I > /',",,;, + r" I . (10) 




where, for all i G [n], P\ € K." is a stochastic vector, r\ £ R, and Bi is the domain of g*, see [AG03, 
Prop. 2.1 and Cor. 2.2]. Since the map <?, is polyhedral, the domain of is also a polyhedral 
convex set, see |Roc70l Th. 19.2], and since it is included in the set of stochastic vectors, it is 



compact, hence it is the convex envelope of the finite set of its extremals. Then, in (16), Bi can 
be replaced by this finite set. 

Since g is polyhedral, order preserving, and additively homogeneous, we get by Kohlbcrg theo- 
rem [K0I18O recalled in Section [2] that g has an invariant half- line (77, v), rj is necessarily equal to 



x{g)^ arL d by (14), v and rj satisfy rj — g(rj) and rj + v = g^(v), where g and g r> are defined in (10) 



and (11a) respectively. When g is given by (16), these maps can be rewritten as 




[g(r))]i = max > /f. Vj , * G [n] , (17) 
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and 



\9r,{v)]i 



Let us fix an invariant half-line (r),v) of g. Denote g(w) := g n (w) — rj, then v is harmonic with 
respect to g: g(v) — v. We define the set of critical nodes of g, C{g), to be the set of critical 
nodes of g. A half-line w : t 1— > trj + v is super-invariant if g o w(t) < + 1), for i large enough. 
From ([9]), this property is equivalent to the conditions rj > 17(77) with > g~i(v) when 77^ = gi(rf). 
In particular when the equality r\ = 17(77) holds, it is equivalent to v > g(v). 

Corollary 1. Assume that g is a polyhedral, convex, order preserving, and additively homogeneous 
self-map o/R n . Assume that W : 1 1— » trj + v is a super-invariant half-line of g with rj = x(d)- Then, 
there exists a unique invariant half-line of g which coincides with w on the set of critical nodes of 
g. It is given by 1 1— > trj + g"(v), where g : w 1— > gr){w) — rj. 

Proof. As said above, an invariant half-line of g must be of the form t h- > trj + z, where r\ = x(<?) 
and z £ K n is a fixed point of g. If u> : 1 1— > tr; + t; is a super-invariant half-line of <? with 77 = x(5)i 
then 77 = ^(77), and by we get w > g(v). From this, we deduce that t >-> ^17 + z is an invariant 
half-line of g which coincides with w on C, if and only if z is harmonic with respect to g and 
coincides with v on C. By Theorem [3j g^iv) is such a harmonic vector, and it is the unique one. 
The corollary follows. □ 

For any super-invariant half- line w of g with 7/ = we define g u '{w) to be the half-line 

trj + g^iv). 



max 



Y. '■■ ■ I . ( 18a ) 



argmax 

beBi 



■ ( 18b ) 

J'6[n] 



4 Policy iteration algorithm for stochastic mean payoff games 

The following policy iteration scheme was introduced by Cochet-Terrasson and Gaubert in [C TG06] . 
We first give, in Algorithm [T] an abstract formulation of the algorithm similar to the one given 
in |CTG06j . which is convenient to establish its convergence. A detailed practical algorithm will 
follow and the proof of the convergence of the algorithm will be given in the last subsection. 

4.1 The theoretical algorithm 

In order to present the algorithm, we assume that every coordinate of / : E™ — > K.™ is given by: 

f i (v)= nrm/>) , (19) 

where Ai is a finite set, and /? is a polyhedral order preserving, additively homogeneous, and 
convex map from W 1 to R. These conditions all together are indeed equivalent to the property 
that / is of the form (6j7), since as already observed any polyhedral order preserving, additively 



homogeneous and convex map g from K™ to R can be put in the form (16 1, with Bi a finite set. 
For all feedback strategies a g Am = {a~ '■ [n] —> A, in- cr(i) G A{\, we denote by the self-map 
of R" the i-th coordinate of which is given by f^ = f" . 

Algorithm 1 (Policy iteration for multichain mean payoff two player games |CTG06p . 



Input: A map / the coordinates of which are of the form ( 19 ) 



Output: An invariant half-line w : t \— > trj + v of / and an optimal policy a 6 A 



1. Initialization: Set k — 0. Select an arbitrary strategy o~q G Am- Compute an invariant 
half-line of f^°\ : t h-> trj^ + v m . 
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2. If / ow^it) = w( k '(t-\- 1) holds for t large enough, the algorithm stops and returns w^ k ' and 

0~k- 

3. Otherwise, improve the strategy o~k for u/ fc ) ; by selecting a strategy ct^+i such that fow^ (t) = 
f(°k+i) o ■ U i( fc )(i) ; for i large enough. The choice of cr^+i must be conservative, meaning that, 
for all i € [n], cr fc+1 (j) = crfe(«) if /i ° = o w^ k \t), for i large enough. 

4. Compute an arbitrary invariant half-line w'(t) : t i-» t77^ +1 ' + i/ of /( CT *>+i). If ^C £ + 1 ) ^ r7 ( fe ) 
then set u(' c + 1 ) = i.e. iy( fc+1 ) = w' , and go to step pi Otherwise (r)( k+1 ) = ?/ fe )), we say 
that the iteration is degenerate. 

5. Compute the invariant half-line w ( - k+1) = (f^+^) u ( w ^) of /K+i), and define t/ fe +i) and 
^(fc+i) by = t77( fe+1 ) + v^ k+1 \ 

6. Increment k by one and go to step[2j 

Let us give some details about the well posedness of this algorithm. First, the existence of 
the invariant half-lines in Steps [I] and [4] follows from Kohlberg theorem |Koh80| applied to the 
polyhedral order preserving additively homogeneous maps with k > 1. Second, due to the 

finiteness of the action sets Ai and the fact that the maps /f are polyhedral, the maps / and /f can 
be rewritten in the form ([£]). Hence, the test of Step [2] and the asymptotic optimization problem of 
Step [3] can be rewritten as an equality test for (germs of) half- lines and the pointwise minimization 
of a finite set of half-lines, which are transformed into systems of equations and lexicographical 
optimization problems, using the representation of half-lines as couples (77, v) instead of maps 
w : t !->■ tr\ + v, see the following section for details. 

Finally, at each iteration k of Algorithm [l] : t \-> trf k ^ + is a super-invariant half-line 
of /( CT '=+ 1 ). Indeed, by construction of o~k+i, and since is an invariant half-line of f^ ak \ we get 

/(*»+0 («,(*)(*)) = /(«;<*>(<)) <f(""\w^(t)) =ti;W(i+l) , (20) 

for t large enough. Moreover, since is an invariant half-line of f^ k \ we have x(f^ ) = V ■ 
Hence, in Step |5j is a super-invariant half-line of /( <7k + 1 ) with slope 77'*' equal to ^( fe+1 ) = 
x (/K+i)). By Corollary [lj there exists a unique invariant half-line of /(°>»+0 which coincides 
with on the set of critical nodes of and it is given by w^ 1 ^ — ^f( a w)) u {w( k ^) : 

t h-> <r/ fe+1 ) + with u( fc+1 ) = ^/On-O^ (v^). Practical computations are detailed in the 

following sections. 



4.2 The practical algorithm 

All the steps of Algorithm[T]involve equality tests or pointwise minimizations of half-lines. However, 
it would not be robust to do these tests on half-lines just by choosing an arbitrary large number 
t in the equations and inequations to be solved. We shall rather use the equivalence between the 
representation of a half- line as a map w : 1 1— > tr\ + v with t large and that as a couple (77, v). This 
allows one to transform all the tests into systems of equations or optimizations of finite sets of 
half- lines for the pointwise lexicographic order (which is linear, for each coordinate). This means 
that we are solving the system of equations ( |14[ ). Then, using the notations of Section [2] the 
corresponding practical algorithm of the formal Algorithm [T] is given below in Algorithm [2j 

Algorithm 2 (Policy iteration for multichain mean payoff two player games). 

Input: A map / the coordinates of which are of the form (19 1 and the notations ( 7|6 1 and ( [TTj 



13) 



Output: An invariant half-line (77, v) of / and an optimal policy a G A 



M • 



1. Initialization: Set k — 0. Select an arbitrary strategy er € Am- Compute the couple 
t/°)) solution of 



r?| 0) =V;i,ffo(i)) 
r,r+vf = i>)(w (0) ;i,<7 (i)) 



(o) . (o) r . ,„, for a11 * e W ■ ( 21 ) 
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2. If r]( k ) and satisfy System ( fl4| ), or equivalently if Uk+i = <Jk is solution of (22 1 below, 
then the algorithm stops and returns {rf- k \ v^) and o^. 

3. Otherwise, improve the policy at & Am for (n^ k \v^) in a conservative way, that is choose 
(Tfc+i € Am such that 

07c+i(i) € aigmin ^F n (fc)(i/ fc );i,a)[ 

aei. nW 1 J for all % e [n] . (22) 

Cfc + i(i) = ffc(i) if o-fc(i) is optimal, 

4. Compute a couple (r/ fe+1 ), i/) for policy Cfc+i solution of 

{ (fc+1 f +1) = P M k+ y^] forallfeW- (23) 
[ ??; + w i = ^wi)(!/;i|ffHi(«)) 

If 77W then set ti( fe+1 ) = v' and go to step [6] Otherwise, the iteration is degenerate. 

5.i) Let g := f( a >>+i) (g. — F(-;i,ak+i(i)))- Compute C{g) the set of critical nodes of the map 
g defined by : g = <L(m-ij (■) — rj^ k+1 \ or equivalently: 

g t {v) = F v ik+i)(v;i,<Tk+i(i)) - for all t 6 [n] , 

for which v' is a harmonic vector. 
5.ii) Compute z/ fe+1 ) = ^(v^), that is the solution of: 

/ vl k+1) = ^ (fc+1) (^+ 1 );z,a fc+1 ( i ))-r 7 f +1) i€[n]\C(g) 
4 k+1) - i e C(g) . 



(24) 



6. Increment k by one and go to Step[2j 

It remains to precise how the steps are performed. Step |3| is just composed of lexicographic 
optimization problems in finite sets. The systems (21 1 and (23) are the dynamic programming 
equations of a one player multichain mean payoff game, they can be computed by applying the 
policy iteration algorithm for multichain Markov decision processes with mean payoff introduced by 



Howard How60 and Denardo and Fox [DF68] . Note that one can also choose to solve Systems (21 ) 



and (23) by applying Algorithm [2] to the maps h = f^" a ^ and h = respectively, while 

replacing minimizations by maximizations, but in that case the algorithm is almost equivalent to 
that of Howard |How60| and Denardo and Fox [DF68] , see Section [5] below. In Step [5] the set of 
critical nodes of g, that is that of g, can be computed using a variant of the algorithm proposed 



in |AG03[ § 6.3] described in Section 5.3 Finally, System (24) is the dynamic programming equation 
of an optimal control problem with infinite horizon stopped when reaching the set C (g) which can 
be solved using the original policy iteration algorithm of Howard [How60 . We shall recall all these 
algorithms in Section [5] 

4.3 Convergence of the algorithm 

In this subsection, we show in Theorem [7] that Algorithm [T] or equivalently Algorithm [2] termi- 
nates after a finite number of steps. This result is proved using Theorem [3j Let first show some 
intermediate results. 

The following lemma is known, see for instance Sorin |Sor04j . 

Lemma 4 (See |Sor04] ). Let g denote an order preserving self-map o/M n , that is nonexpansive 
in the sup-norm, and has a cycle time xid)- If w : t tr] + v is a super-invariant half-line of g, 
then, x(g) < V- 

Proof. We reproduce the argument, for completeness: if w : t > trj + v is a super-invariant half-line 
of <7, that is g(w(t)) < w(t + 1) for t > to for some to > 0, then, g k (w(t)) < w(t + k), for all k > 0, 
and t > t , and so x(g) < h m fc->-oo w (to +k)/k = n, which shows Lemma]!} □ 
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Since, by |20| ), is a super-invariant half-line of f^ ak+1 \ with slope = x(/ < - <Tfc - ) ), it follows 
from Lemma |4j that : 

Lemma 5. 77ie sequence of strategies defined in Algorithm^is such that 

x(f [ak+l) ) < x(/ K) ) ■ 

We now examine degenerate iterations. 

Lemma 6. Let (<Tk)k>i be the sequence of strategies defined in Algorithm^ and assume that 
X(f (ak+l} ) = x{f {ak) YThen, the following statements hold. 

1. The half-line w^ k+1 ^ agrees with w 1 --^ on the set of critical nodes of f^ ak+1 \ 

2. Every critical node o//^ <Tfe + 1 - 1 is a critical node of f^ ak> . 

3. w;( fc + 1 ) < w {k \ 

Proof. Let us use the notations: g := f^^+i) ( as m Algorithm [2]) and h = f^ k \ By construction 
and assumption, we have rf k > = x(h) = x(ff) — f/ fc+1 that we shall also denote by rj. 
Pointln Since, by (20), is a super-invariant half-line of g,with slope r/ fe ) = x(fiO) an d since 



w 



FJ " i,lvv l J ) " " uuj^^i iii.uiiuiiu null ""^ Jj_ " 

is defined as g u (w^), the result follows from Corollary 111 
Point^ Again, since is a super-invariant half-line of g, with slope = xG?)i we deduce 
from the definition of g and ([9]), that 

g(v^) < V W . (25) 

Then by Lemma § g(w (fe) ) agrees with v^ k ' on C(g) = C(g), the set of critical nodes of g, and so, 
the equality g(w^(t)) — w^(t+l) holds on C(g) for t large. Since w^ k ' is an invariant half line of 
f^ k \ we get that fi(w^{t)) = f^ k+l) {w {k \t)) = w {k \t + I) = f^ k \w^(t)) for t large enough 
and i G C(g). Hence, the conservative selection rule ensures that au+i{i) = ffe(i) for all i G C(g). 
This implies that gi = hi for all i G C(g), and since x(<?) = we S e t from the definitions of g 

and h that 

3, = hi for all i G C(p) . (26) 

Observe that w( fc+1 ) is a fixed-point of g, and that <? is a polyhedral additively homogeneous order 
preserving convex selfmap of E™. Hence the critical nodes of g are the indices that belong to a 
final class of a matrix M G dg(v( k+1 ') (since the elements of g(i/ fe+1 )) are stochastic matrices, 
all their final classes are recurrent). Let F be such a final class. From ( |15[ ), the line Mi. G 
dgi(v( k+ V) for i G F, that is - g i {v[ k+1 '>) > M,.{v - v ( - k+1 '>) for all 1; G E n . Since 

is a fixed point of g, a fixed point of h, and v^ k+1 ^> agrees with i>W on C(g) (from Point [I]), 
we get that gi(v^ k+1 ^) = vf° +1 ^ = = h\(v^) for all i G C(p). From (26), we deduce that 



9i( v ) ~ 9i( v ^ k+1 ^) — hi(v) — hi(v^) for all i G C(g) and v G E™. Now, since F is a final class of 
M, hence F C C(g), and My = for i G F and j C(#), we get that M i .v ( - k+1 ^ 1 = M t .v( k ~> for 
i G F. This implies that hi(v) — hi(v^) > Mj.(u — »W) for all t> G E™ and i G F, which shows that 
Mj. G dh^v^) for i G F. Let N := [n] \ F and define the_ matrix Q such that Q;. = M. L . if i G F, 
and Qj. be any element of dhi(v^) if j G N, then Q G dh(v^ k '). Hence, the F x F submatrix of 
M is also a F x F submatrix of Q, and so F is a final class of Q. Since v( k ' is a fixed point of h, 
this implies that F is included in the set of critical nodes of h, which is also by definition the set 
of critical nodes of h. This shows that all critical nodes of g are also critical nodes of h, and shows 
Point El 

Point |3r From (25), we get that g(v^) < v^ k \ hence the sequence g k (v^) is nonincreasing and 
g"(v( k >) < y( k \ Since r)W = n^ k+1 \ we get that w {k+1) = g u {w^) = trf® + g(v^) < w^ k \ □ 

Finally, we prove that the algorithm terminates. 

Theorem 7. A strategy cannot be selected twice in Algorithm^ and so, the algorithm terminates 
after a finite number of iterations. 
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Proof. Assume by contradiction that the same strategy is selected twice in Algorithm [Tj that is 
c s = cr m for some iterations 1 < s < m of the algorithm before it stops. Then, x(/^ CTs - ) ) = xif^'^) 
and since by LemmajHJ x(/ (<Ts) ) > x(/ (<Ts+l) ) > ••• > x(/ (CTm) ), we get the equality x(/ (<Ts) ) = 
x (/(^+i)) = ... = x(X CTm) )- Hence, by Lemma^ Part|J we have that C(f a ^) C C(/( <T '— 1 )) C 
■ ■ • C C(/^) and since cr s = <7 m , we get the equality C(/( ff "0) = C(/( <T '- 1 )) = ■ ■ • = 
So by LemmajH] Partjl] and u/ m ) are both invariant half-lines of / < - <Ts - ) with slope xif 1 " 17 ^), 
that agree on C(f^^). Hence by Corollary [j] = w^ m \ Since by Lemma [HJ Part [3J we have 
w (s) > > ••• > w {m \ it follows that 10 W = ••• = w^ m \ In particular, = w^ 1 ^. 

Hence, w^(t + 1) = w( s+1 )(t + 1) = o w( s+1 )(i) = o iy( s )(t) = / o w^ s \t) for i large 

enough. It follows that tow is an invariant half-line of /, and so, the algorithm stops at step s, 
which contradicts the existence of iteration m, and so the same strategy cannot be selected twice 
in Algorithm [l] 

Since the sets Ai are finite, the number of strategies (the elements of ^4m) is also finite, and 
since a strategy cannot be selected twice, Algorithm [T] stops after a finite number of iterations, 
that is bounded by the number of strategies. □ 



Ingredients of Algorithm [T] or [2^ one player games algo- 
rithms 



As said in Section |4.2| each basic step of the policy iteration algorithm for multichain mean payoff 
zero-sum two player games (Algorithm[I]or[2]) concerns the solution of one player games, also called 
stochastic control problems or Markov decision processes, with finite state and action spaces: 



a mean payoff problem for Systems fl2l| and (23), an infinite horizon problem stopped at the 



boundary for System (|24j) , and the set of critical nodes of the corresponding dynamic programming 
operator in Step [5] We recall here the policy iteration algorithm for solving stochastic control 
problems, with either infinite horizon or mean payoff, and the algorithm proposed in [AG 03 , § 6.3] 
for computing a critical graph, and explain how all these algorithms are applied in Algorithm[T]or[2j 
By doing so, we shall also see that the classical Howard / Denardo-Fox algorithm can be thought 
of as a special case of these algorithms, in which the second player has no choices of actions. 

In all the section, we consider the following dynamic programming or Shapley operator of a one 
player game with finite state and action spaces: g is a map from K™ to itself, given by : 

[g(v)]i := max G{v;i,b) Vi € [n], v € R n , (27) 

b£Bi 

where 

G(v;i,b) - P ii v i + r i > ( 28 ) 

j'6[n] 

the vectors are substochastic vectors, for all i £ [n] and b £ B i: and B t are finite sets, for all 
i £ [n]. Equivalently, g is a convex additively subhomogeneous order preserving polyhedral selfmap 
ofK". 

Since player MIN does not exist, the set of feedback strategies for player MAX, Bm, is given by 
Bm '■= {8 ■ [n] — > B I 6(i) £ Biii £ [n]}, where B contains all the sets P^. For each 5 £ Bm, we 
denote by g^ the self-map of 1" given by: 

gf\v) := G(v;i,5(i)) Vi £ [n], v £ E™ . 

We also denote by the vector of 1" such that rf^ = rf^ and the n x n matrix such that 

pip = pWO) ) then g(S) . v ^ p{8) v + 



5.1 Policy iterations for one player games with discounted payoff 

(|24[) consists in finding the solution v of the equation v = g{v) with v 



System 

u £ M" is super-harmonic with respect to g, g(u) < u, and g is as in (27) with (28) 



u on C{g) where 
The solution 



v is thus the value of a one player game with infinite horizon stopped when reaching the set C (g) 
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whose transition probabilities are given by the P^, instantaneous reward is given by the r\ and 
final reward is given by itj, when the game is in state i E C(g). This value function can be 
obtained using the classical policy iteration algorithm of Howard [How60j for a one player game. 
From Theorem |3j v is solution of the above equation, if and only if vq = uc and is a fixed 
point of the convex polyhedral additively subhomogeneous order preserving selfmap h of R*, with 
C = C(g), N — [n] \ C , and h defined as in Theorem [3j Point (iii), with g replaced by g. One can 
also consider the equivalent equation v — h(v) with hi = g~i for i € N and hi(v) = Ui for i £ C 
and v E W l . In that case, h is a convex polyhedral additively subhomogeneous order preserving 
selfmap of R". 

In these two settings, we need to solve an equation of the form v — g(v), where g is of the 



form (271, and g has no critical node: C(g) — 0. From AG03, Corollary 1.3], g has a unique fixed 
point and all the maps with 5 E Byi have a unique fixed point (since their critical nodes are 
necessarily critical nodes of g). The policy iteration algorithm of Howard applied to this equation 
is then given by Algorithm [3j 

Algorithm 3 (Policy iteration of Howard [How60] for stochastic control problems). 



Input: A map g of the form (27) with no critical node. 
Output: The fixed point of g and an optimal policy 5 E Bm- 

1. Initialization: Set k = 0. Select an arbitrary strategy 5q E B 



M • 



2. Compute the value of the game with fixed feedback strategy 5k, that is the solution of 
the linear system: 

„W = g(^(v^) . 



3. If = g(v^), or equivalently if Sk+i = 5k is solution of (29) below, then the algorithm 
stops and returns and 5k- 

4. Otherwise, improve the policy 5k+i € Bm for the value : 

5 k+ i(i) € argmax G{v (k) ;i,b) Vi € [n]. (29) 

5. Increment k by one and go to Step[2j 

It is known |How60j that v^ k+1 ^ < and that the algorithm stops after a finite number of 



steps. 



5.2 Policy iteration for multichain one player games 



Consider a one player game with dynamic programming operator g given by ( 27 ) and mean payoff. 
Then, as explained in Section [2] in the more general two player case, the mean payoff of the game 
is the slope r\ of any invariant half line (77, v) of g, which is also any solution of the following couple 
system (see Equation (14)): 

v = gin) 

tj + v = g v (v) . 



(30) 



where g and g^ are defined in (10) and (11) respectively. In the present one player case, they are 
reduced to: 



[g(rj)]i := max G(rj;i,b) 



and [5?;( w )]i '■— max G(v;i, b) , 

beB, „ 



(31) 



with 



(32) 



G(V1 i,b)= P ij Vj and Bi,v '■= argmax <^ ^ P h 
ie[n] beB ' [je[n] 

for all rj, v E E™, i E [n], b E B. We refer also to |DF68[ IPut94] for the existence of solutions to 



System (30), and for the proof that r\ solution of this system is the mean payoff of the game in this 



one player context. The following algorithm for multichain mean payoff Markov decision processes 
was introduced by Howard |How60j and proved to converge by Denardo and Fox |DF68| : 
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Algorithm 4 (Policy iteration algorithm for multichain mean payoff one player games). 
Input: A map g of the form ( [27] ) with (28 1, and the notations ( 31|32 ). 
Output: An invariant half-line (t),v) of g and an optimal policy 8 G Bm- 

1. Initialization: Set fe = 0. Select an arbitrary strategy Sq G -Bm. 

2. For each final class F of p( Sk \ denote by «f the minimal index of the elements of F, and 
define 5 as the set of all these indices ip. Compute the couple {rf k ' ,v^>) for policy S k 
solution of 

rf k) = G(r)W;i,6 k (i)) i G [n] \ 5 
r ? f ) +^ fe) = G( W «;i,4W) te[n] (33) 
vl k) = i e 5 . 



3. If (rf k \ v^) is solution of p0| ), or equivalently if 4+1 = <5fc is solution of (34) below, then 
the algorithm stops and returns (r]^ k \v^ k ') and 5k- 

4. Otherwise, improve the policy Sk+i G Bm for t^* 1 ) in a conservative way, that is choose 
<5fc+i G Bm such that : 

o~k+i{i) G argmax G(v^ k '; i, 6) 

be^^(fe) for all i 6 [n] . (34) 

Sk+i(i) = 5k(i) if is optimal, 

5. Increment k by one and go to Step [2] 

The justifications and details of Algorithm [4] can be found in |DF681 IPut94j and are recalled in 
Appendix. Solving System (33) turns out to be a critical step. This can be optimized by exploiting 
the structure of the system, we discuss this issue in Appendix. As explained in Section [4~2 another 
way to solve a multichain mean payoff Markov decision process may be to use Algorithm T] or [2] in 
the particular case of a one-player game, with maximizations instead of minimizations. In order to 
compare it with Algorithm |4j we rewrite below Algorithm [2] in that case, with the above notations. 
Note that in the one-player case, the map g of Step [5] of Algorithm [2] is affine, hence its critical 
graph reduces to the final graph of its tangent matrix. 

Algorithm 5 (Specialization of Algorithm [2| to the one player case). 
Input: A map g of the form ( [27] ) with ( [28] ), and the notations ( 31|32 ). 
Output: An invariant half-line (r),v) of g and an optimal policy 6 G Bm- 

1. Initialization: Set k — 0. Select an arbitrary strategy So G Bm- Compute the couple (t/ -', 
v(°>) solution of 

{ v u =^;°;;w)) foralHe[n] . (35) 

2. If ?/ fe ) and v^ k ' satisfy System p0| , or equivalently if 5 k +i = 5 k is solution of (36) below, 
then the algorithm stops and returns v 1 --^) and 5 k . 

3. Otherwise, improve the policy 5 k G -Bm f° r (r]^ k \v^) in a conservative way, that is choose 
5k+i G -Bm such that 



o~k+i(i) G argmax G{v^ k '\ i, b) 
4+i (i) = $k(i) if Sk(i) is optimal, 
4. Compute a couple (n( k+1 \ v') for policy Sk+i solution of 

= G( V ( k +V;i,5 k+1 (i)) 



for all i G [n] . (36) 



for all i G [n] . (37) 



\ V^+v', = G(v';i,6 k+1 (i)) 
If ^( fe+1 ) ^ ?y( fe ) then set ti( fc+1 ) = v' and go to step [6] Otherwise, the iteration is degenerate. 
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5.i) Compute C the set of final nodes of the matrix p(*»+0 . 
5.ii) Compute the solution of: 

/ „(* + 1 ) = G(«(*+ 1 );i,<y fc+1 (i)) - r?f +1) i e [n] \C 
\ «< fc+1) = «<*> ' 1£ C. 

6. Increment fc by one and go to Step[2j 



(38) 



Systems ( 35 ) and ( 37 1 are of the form 



r\ = Pr) 
7] + v = P v + r 



(39) 



where r = r^ s ' G R." and P = P^ is a stochastic matrix, with 5 = 5q or <5fc+i. It can be shown 
that the solution rj of such a system is unique, that one can eliminate for each final class F of P one 
of the equations rji — {Pffji with index i G F, and that v is defined up to an element of the kernel 
of / — P, the dimension of which is equal to the number of final classes of P. When this number 
is strictly greater than one, and is chosen to be any solution v' of (37) in Algorithm I5J the 



algorithm may cycles, see Section [6] for an example in the two player case. One way to handle 
this |DF68( IPut94j . is either to fix to zero the value of fxpv for each invariant measure /j^ of P 
with support in a final class F of P, or to fix to zero the components of v with indices in some set 



S containing exactly one node of each final class of P. In these two cases, the solution v of (39) 
become unique. Moreover, if in Algorithm [5] ( |37[ ) is combined with either the conditions hfv' = 
or the conditions v' s = with S chosen in a conservative way, that is such that the same index 
is chosen in F for iterations k and k + 1, if F is a final class of P^^ 1 ' which is also a final class 
of p( 5fe ), then v' = on the set of final nodes of p(**>+0 when = rj( k \ which implies 

that v' — v( k+1 \ hence Step [5] of Algorithm [5] becomes useless. This shows that Algorithm [4] is 
equivalent to Algorithm]^} where ( [37] ) is combined with the conditions Drr = 0, where S is the set of 
minimal indices of each final class of p( 5fc + 1 ) . In other words, AlgorithmUjis a particular realization 
of Algorithm [HJ where one chooses one special solution v' = v( fc+1 ) of (|37|) at each iteration of the 
algorithm, even when rj^ k+1 ^ ^ rf k \ Dcnardo and Fox proved JDF 68, Put94] that the sequence of 
couples (rf h \ v^)k>i of Algorithm [4] is non decreasing in a lexicographical order, meaning that 
jy(*+x) > ? j(fc) ) v (k+i) > v (k) wnen ^(fe+i) _ jjCfe)^ an( j that Algorithm |i] stops after a finite 

number of iterations (when the sets of actions are finite) . Indeed, the convergence of Algorithm [l] 
proved in Section |4~3"1 shows that this also holds for the little more general Algorithm [5j 



5.3 Critical graph 

When a degenerate iteration [r]^ k+1 ^ = r/ fc )) occurs in Step [4] of Algorithm [I] one has to compute 
the critical nodes of g := /^ <Tfc+1 ', that is that of g. This can be done by applying the techniques 
of |AG03[ § 6.3] , leading to Algorithm [6] below. More precisely, one applies first the followings 
steps to the map g and its harmonic vector v', then apply Algorithm [6] 



Consider an additively homogeneous map g whose coordinates are defined as in (27) with (28), 
and u a harmonic vector of g. For any set V of stochastic matrices, we define Q i {V) as the union 
of the graphs of the matrices Mpp, where M G V and F is a final class of M. Define 

Bi = {b G B, I G(u; i, b) = u} and V t = {P* \ be PJ . (40) 

Then, the critical graph of g is given by 

G°(g) = Q'idgiu)), where dg{u) = co(P x ) x • ■ ■ x co(P„) , (41) 



and co(-) denotes the convex hull of a set. The following algorithm computes the graph in (41 ) for 
a general family {Pi}ie[n]i where Pi C R™ is a nonempty finite set of stochastic vectors. Note that 
any such family {'Pi}ig[ n ] corresponds to the map g : M. n — > M. n such that 

= maxpu for all i G [n] , (42) 



19 



which has u — as a harmonic vector, and is of the above form. Hence the algorithm below 
corresponds also to the computation of the critical graph of this map g. 

Before writing the algorithm, we recall some definitions of graph theory (see for instance [CLRSOi] ). 
We define a graph G := (V, E) as a finite set of vertices (or nodes) V and a set of edges (or 
arcs) E := | i,j G V}. A path of length I > is a sequence (io,...,ii) such that 

ik G V for k G {0, . . . , 1} and (ik,ik+i) G F for fc < L A strongly connected component of G 
is the restriction G\y of G to some subset of nodes V C 1/, that is the graph (V',l£') with 
F' := G E | i, j 6 V'}, where V is such that there exists a path from each node % G V' to 

every node j G V. A strongly connected component G' is called trivial if it consists in exactly 
one node and no arcs. We define a final class of G = (V, E) as a non trivial strongly connected 
component G' — (V, E') of G such that there exists no arc G E with i G V and j 6 F \ V'. 
Note that the strongly connected components of a graph can be find using Tarjan algorithm, 
see |CLRS01j . 

Algorithm 6 (Algorithm to compute the critical graph, compare with |AG03[ § 6.3]). 

Input: (Vi, ■ ■ ■ ,V n ) where Vi C R™ is a finite set of stochastic vectors for i G [n]. 

Output: A graph depending on V\, ■ ■ ■ , V n , equal to G l (co(Vi) x • • • x co(V n )) if all the Vi are 
nonempty; and its set of nodes. 

1. Set = 0, /(°) = [n], G (0) = 0, Q,[ 0) = V t for ie[n], and fc = 0. 

2. If all the sets {Q^}igj(*o are empty, then the algorithm stops and returns and F^. 

3. Otherwise, build the graph G = (I^ k >,E) with set of nodes /' fc ', and set of arcs E — G 
/(*) x | Pj ^ for some p G Q-^}- Set F as the union of final classes of G. 

4. Put /( fe+1 ) = j( fc ) \ F and F^ 1 ' = F( fc ) U F. 

5. Set G (fc+1) = G (fe) U G\ F where G| F denotes the restriction of G to F. 

6. For all i G /( fc+1 ), define the sets Qp +1) C M /<fc+1) of row vectors obtained by restricting to 

^}j e vectors p G Q- fe such that 2je/( fc + 1 ' = 

7. Increment k by one, and go to Step [2] 

The convergence (after at most n iterations) of this algorithm follows from variants of Lem- 
mas 4.7 and 4.9 of [AG03] . applied to the maps constructed by (42) from the families {Q^)i^[ n } ■ 
Indeed, if all the with i G 1^ are nonempty, the map g^ is a map from K™ to itself and 
Lemma 4.7 says that gk has at least one invariant critical class, which implies that the set F of 
StepH is nonempty. Moreover, Lemma 4.9 says that, if all the Q^ fc+1 ^ with i G are nonempty, 

the critical graph of gy. is equal to the union of G\p with the critical graph of the map gu+i- 

In order to generalize these arguments, one need to extend the notion of critical graph to the 



case of a map g from (K U {— oo})™ to itself, of the form (42) with general families {T^lieM °^ 
(possibly empty) finite sets of stochastic vectors (or of the form (27) with (28), with a harmonic 
vector u G (K U {— oo})"). For instance, define the critical graph of g as the restriction to the set 
of nodes i G [n] such that Vi is nonempty (or Uj ^ — oo) of the critical graph of g V id, where id 
is the identity map and V denotes the supremum operation. Then, the identically — oo map has 
no critical class, any map g which is not identically — oo has an invariant critical class, and the 
above recurrence formula for critical graphs is true even if gk+i takes — oo values. This shows that 
Algorithm [6] computes the critical graph of the map g associated to the family {Vi}i e [ n ], even if 
some of the sets of the family are empty. 

Note that since Tarjan algorithm has a linear complexity in the number of arcs of a graph, the 
complexity of the above algorithm is at most in the order of nm, where m is the sum of the number 
of arcs of all the elements of Vi, i G [n] . This is comparable with the complexity of solving the 
linear systems of the form (331 by LU solvers, hence with the other steps of Algorithm [2] 
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6 An example with degenerate iterations 



In this section, we present an example of zero-sum two player stochastic game for which we en- 
counter a degenerate iteration when using the policy iteration algorithm for the mean payoff prob- 
lem, and showing that Step[5]of Algorithm [I] is essential to obtain the convergence of the algorithm. 

Before doing this, let us note that some degenerate cases may be not so problematic. Indeed, 
as observed before, the map g of Step [5] of Algorithm [2] is a polyhedral order preserving additively 
homogeneous convex map. By [AG031 Theorem 1.1], the set of fixed points of g is isomorphic to a 
convex set which dimension is the number of strongly connected components of the critical graph 
of g and which is invariant by the translations by a constant function. In particular, if the number 
of strongly connected components of the critical graph is equal to one, then the set of fixed points 
of g is exactly equal to the translations of v' by a constant, hence w( fe+1 ' — v' is a constant function. 
Since all the maps considered in Algorithm [2] are additively homogeneous, this implies that taking 
v' instead of v^ k+1 \ that is applying the same steps as in the nondegenerate case, does not change 
the sequence of policies (<7fc), and the invariant half lines are just translated by a constant after 
this degenerate iteration. Hence, the second part of Step [5] may be avoided in Algorithm [2j when 
one encounters only such degenerate iterations. However, to know that g has only one strongly 
connected component in its critical graph, one need to apply the the first part of Step [5] 

We show now an example for which degenerate iterations occur with two strongly connected 
components of the critical graph of g. We shall call these iterations strongly degenerate. 

We consider a directed graph, with a set of nodes (or edges) [n] and a set of arcs E c [n] x [n] , 
in which each arc {i, j) is equipped with a weight £ K, and consider the map / from M n to 
itself, defined by: 

M v ) = ^( max i r ij+ v j) + min ( r i3+ v j)) ■ ( 43 ) 

When the value of v is fixed at some "boundary" points, and the weights are independent of 
j, the map / arises as the dynamic programming operator of the "tug of war" game |PSSW09] . 
which can viewed also as a discretization of the infinity Laplacian operator. Moreover the case 
where all the weights rij are equal to zero corresponds to a class of auction games, called Richman 
game [LLP + 99] , Therefore, the above map / appears as the dynamic programming operator of a 
variant of these games with additive reward and mean payoff. 

We apply the policy iteration algorithm to such a game, with a graph of 5 nodes and complete 
set of arcs E = [5] x [5] . Hence, the action spaces Ai and Bi in every state i £ [n] can be identified 
with the set [5] . The weight of each arc (i, j) £ E is defined as the entry of the following matrix : 

( 1 -1 \ 

1-10 

1-10 

1-10 

\ -1 -1 1 I 

the adjacency graph of which is represented in Figure [T] 

Let us fix the initial strategy Co for the first player, such that uo(l) = 2, oo(2) = 2, co(3) = 4, 
Co (4) = 4, o"o(5) = 2. Then, the corresponding dynamic programming operator f^ ^ is given by 

/i (CT0) M = f2°\v) = \(-l+v 2 + max(l + Ul , -1 + v 2 , v 3 , v 4 , v 6 )) 

/i CT0) M = ft°\v) = i(-l+»4 + max(«i,iM + tfc,-l+«4,« B )) 

f ( 5 ao) (v) = -(-i + V2 + m ax(v ll -l + v 2 ,v 3 ,-l + v A ,l + v B )) . 

In Step hj of Algorithm [ll we compute an invariant half-line of Z^ -* and obtain for instance 
«j(°)(t) = (jfl\vM), withvTO = (0,0, -0.5, -0.5, 0) T andr?(°) = (0, 0, 0, 0, 0) T . Since /(u><°>(*)) < 
/( <7 o)(iy(°)(t)) ) we need to improve the policy (Step [5]) and get the unique solution (even without 
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Figure 1: Adjacency graph of r. 



the conservative policy): oi(l) = 2, <r 1 (2) = 2, 0"i(3) = 4, oi(4) = 4, o"i(5) = 4. The corresponding 
operator is then given by : 



fP> = fp' 1 < * < 4, 
1 

2< 



/s ( u ) = ^(— l + «4 + max(«i,— l + « 2 ,U3,-l + «4, l + «s)) 



We compute then (in StepEJ an invariant half-line (?7 (1) , i/) of and obtain r; (1) = (0, 0, 0, 0, 0) T 

and for instance v' = (0, 0, 0.5, 0.5, 0.5) T . Since rf 1 ' = rf°\ the iteration is degenerate. 

Hence the algorithm enters in StepJEJ Set g := f( ai \ We have to compute the critical graph 
of g, which is here equal to g, for instance by applying Algorithm [6] to the sets Vi defined in (401 
with u = v'. They are given by P 1 = V 2 = {(0.5,0.5,0,0,0)}, V 3 = Va = {(0,0,0.5,0.5,0)}, 
V5 = {(0,0,0,0.5,0.5)}, then the critical graph of g is equal to the final graph of Vi x ••• x V5, 
which is composed of two strongly connected components with nodes {1,2} and {3,4}. Then, 
is the unique solution of: 



v 



"' - ft ] (v)= \ (-1-5 + max(0, -1,-0.5, -1.5, ^ + 1)) 



(!) _ .,(0) 



ze {1,2,3,4} 



We obtain = (0,0,-0.5,-0.5,-0.5) and since f(w^(t)) = / (<Tl) (w (1) (0). th e algorithm stops. 

However, if we do not treat the degenerate case by using Step[HJ and take for instance = v' , 
we obtain f(w^(t)) < /^^(w^^t)), hence we need to improve the strategy, and obtain the 
unique solution 02 = &a- This means that the algorithm cycle, showing the necessity of Step [5] in 
the policy iterations. 



7 Implementation and numerical results 

The numerical results presented in this section were obtained with a slight modification of the 
policy iteration algorithm Algorithm [2| and of its ingredients of Section [5j all implemented in the 
C library PIGAMES, see |Detl2 for more information. All the tests of this section were performed 
on a single processor: Intel(R) Xeon(R) W3540 - 2.93GHz with 8G0 of RAM. 

These slight modifications take into account the fact that (linear or nonlinear) equations may 
not be solved exactly (in exact arithmetics) because of the errors generated by floating-point 
computations, and also of the possible use of iterative methods instead of exact methods. Let us 
explain them briefly. For instance, the stopping criterion in Step [2] of Algorithm [2] can be replaced 
by a condition on the residual of the mean payoff, f{if k ^) — T]( k > and the residual of the relative 
value, fr/iv^) — j/^ — v( k \ Here, we consider the infinity norm of the residual of the game that 
we define as 0.5 * (||/(»/ fe '') - V^Woo + \\fn(v^) - ~ «^||oo), where || ■ denotes the sup- 
norm. Then, we stop the policy iterations when the infinity norm of the residual of the game is 
smaller than a given value e g > or when the strategies cannot be improved. For the tests of 
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this section, we took e g = 10~ 12 . We use the same condition for the stopping criterion of the 
intern policy iterations, that is for Step [3] of Algorithm [3] and Step [3] of Algorithm HI Moreover, 
the optimization problems in Step [3] of Algorithm [2] and Step [3] of Algorithms [3] and^J are solved 
up to some precision. This means for instance that in Algorithm [2j one choose a^+i G Am such 
that, for all i G [n], 

!F„(*)(v' fc ^« 1 cr fc+1 (i)) < e v + min { F w i, a) \ with 
z ' v{ l-^ (44) 
Ai,r,, e := {a G A, \ F( m i, a) < e + /(??)] J 
£Tfc+i(i) = crfe(i) if crfc(i) is optimal, 

for some given e v and e v > 0. Finally, the linear systems in Step [2] of Algorithms [3] and [4] are 
solved up to some precision, which may be lower bounded when the matrices of the systems are 
ill-conditioned. See the appendix for details about the solution of these linear systems. 



7.1 Variations on tug of war and Richman games 

We now present some numerical experiments on the variant of Richman games defined in Section [6j 
constructed on random graphs. As in the previous section, we consider directed graphs, with a set 
of nodes equal to \n] and a set of arcs Ed [n] 2 . The dynamic programming operator is the map / 
defined in ( [43] ), where the value is the reward of the arc G E. In the tests of Figure [2] to 
Figure HJ we chose random sparse graphs with a number of nodes n between 1000 and 50000, and 
a number of outgoing arcs fixed to ten for each node. The reward of each arc in E has value one or 
zero, that is fy = 1 or 0. The arcs G E and the associated rewards r« are chosen randomly 
(uniformly and independently). We start the experiments with a sizer of graph (number of nodes) 
equal to n = 1000, then we increase the size by 1000 until reaching n = 10000, after we increase 
the size by 10000 and end with a number of 50000 nodes. For each size that we consider, we made 
a sample of 500 tests. The results of the application of the policy iteration (Algorithm [2] with the 
above modifications) on those games are presented in Figures [2] to [4] and are commented below. 

Figure [2] gives for each size n, and among the sample of 500 tests, the number of tests that 
encountered at least one strongly degenerate policy iteration for the first player. Hence, these 
games require the degenerate case issue presented in this paper, that is Step [5] of Algorithm [l] or [2j 
Moreover, from the data of Figure [2] we observe that approximately between 10 and 15 percent of 
the tests have at least one strongly degenerate policy iteration for the first player. 
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Figure 2: Tests on a variant of Richman games constructed on random graphs. The histogram 
shows for each size (number of nodes) , the number of tests having at least one strongly degenerate 
policy iteration for the first player, among 500 tests. 



In the table below we report the number of strongly degenerate iterations that occur in the 
global sample of tests. 



Number of strongly degenerate iterations 





1 


2 


3 


6 


Number of tests 


6051 


919 


28 


1 


1 



We observe that in general there is no more than one or two strongly degenerate policy iterations 
for our sample of tests. Note that in this section, a strongly degenerate policy iteration is to be 
understood as a strongly degenerate iteration for the first player only, that is for Algorithm [2j 
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In Figure [3J we draw on the left curves that represent the number of policy iterations for the 
first player, that is the number of iterations of Algorithm [2j as a function of the size n of the 
graph. The dashed lines on top and bottom are respectively the maximum and minimum value, 
over the sample of 500 tests, and the plain line is the average value, all as a function of the size. 
We observe that the average number of first player's policy iterations is almost constant as the size 
increases. Using the same model of representation, we show on the right of Figure [3] respectively 
the maximum, average and minimum values for the total number of policy iterations for the second 
player, that is the sum of the numbers of iterations of Algorithm [4] when applied by Algorithm [2j 
as a function of the size. We also observe that these values do not vary a lot with the size. 
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Figure 3: Tests on a variant of Richman games constructed on random graphs. On the left, the 
curves from top to bottom represent respectively the maximum, average, minimum number of first 
player's policy iterations, among 500 tests, as a function of the number of nodes. On the right, the 
curves represent the total number of second player's policy iterations. 

In Figure |4j we present on the left the total cpu time (in seconds) needed by the policy iteration 
to find the solution of the game. As for the two previous figures, the curves from top to bottom 
show respectively the maximum, average and minimum values, over the sample of 500 tests, as a 
function of the size of the graphs. Finally, on the right of Figure |4j we give also the average of 
the total cpu time (in seconds) needed to solve the game but we separated the tests with strongly 
degenerate policy iteration(s), represented by the dashed line, from the non strongly degenerate 
ones, represented by the plain curve. We observe that the average cpu time is somewhat greater 
for the tests with strongly degenerate iteration(s). This is due to the additional steps needed for 
degenerate iterations. Indeed, the cpu time of a degenerate iteration should be approximately the 
double of that of a nondegenerate iteration, and since the number of policy iterations is around 
10 in the sample of tests, the average of the total cpu time of tests with (strongly) degenerate 
iterations should be approximately 10 percent greater than that of the other tests. 




Figure 4: Tests on a variant of Richman games constructed on random graphs. On the left, the 
curves from top to bottom represent respectively the maximum, average and minimum values of 
the total cpu time (in seconds) taken by the policy iteration algorithm, among 500 tests, as a 
function of the number of nodes. On the right, the dashed line represents the average among the 
tests that encounter at least one strongly degenerate policy iteration for the first player, whereas 
the plain line represents the average among the other tests. 

In addition, in Table [TJ we give numerical results for ten tests of the variant of Richman game, 
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constructed on random large graphs with a number of nodes between 10 5 and 10 6 . We observe 
that the number of iterations are of the same order as for the previous sample of tests presented 
in Figure [3] 

Table 1: Numerical results on a variant of Richman game constructed on random large graphs. 
Number of Iterations of Total number Strongly degenerate Infinity norm CPU time 



nodes 


first player 


of iterations 


iterations 


of residual 


(s) 




100000 


12 


78 


1 


1.44e- 14 


3.24eH 


-02 


200000 


12 


74 





7.44e - 15 


7.90e H 


-02 


300000 


11 


82 





1.33e- 15 


9.38eH 


-02 


400000 


12 


82 


1 


8.55e- 15 


1.42eH 


-03 


500000 


12 


77 


1 


2.00e - 14 


2.16eH 


-03 


600000 


12 


77 





8.66e- 15 


2.61eH 


-03 


700000 


11 


85 





3.02e- 14 


2.61eH 


-03 


800000 


12 


81 


1 


4.82e- 14 


6.79eH 


-03 


900000 


12 


79 


1 


1.27e- 14 


4.17eH 


-03 


1000000 


12 


90 


1 


3.33e- 15 


1.96eH 


-04 



7.2 Pursuit games 

We consider now a pursuit evasion game with two players : a pursuer and an evader. The evader 
wants to maximize the distance between him and the pursuer and the pursuer has the opposite 
objective. See for instance [BFS94, BFS99, LCS08J for a complete description of general pursuit 
games. To simplify the model, we consider as state of the game, the distance between the two 
players. Then, the state of the game is given by x = Xp — xe where xp is the position of the 
pursuer and xe the position of the evader. We also restrict the state x to stay in a unit square 
centered in the 0-position, that is x € X := [—0.5,0.5] x [—0.5,0.5]. At each time of the game, 
the reward for the evader is the euclidean square norm of the distance between the two players, 
i.e. \\x\\2- Such a game is a special class of differential game, the dynamic programming equation 
of which is an Isaacs partial differential equation. Under our simplifications and assumptions, the 
Hamiltonian of this equation is given by : 

H(x,p) — max(a-p)+ min (b ■ p) + \\x\\l Vs;el,pet 2 , (45) 

a£A(x) beB(x) 

meaning that in the case of a finite horizon problem, the Isaacs equation would be given, at least 
formally (but also in the viscosity sense) by : 

dv 

-— + H(x, Vv(x)) =0 x G X . 
ot 

Here A(x) and B(x) are the sets of possible directions for the evader and the pursuer respectively, 
when the state is equal to x G X. On the boundary, we consider that only actions keeping the 
state of the game in the domain X are allowed, hence the above equation has to be satisfied until 
the boundary. 

We shall consider this differential game with a mean-payoff criterion and the above reward. 



This means that the analogous to System (|14|) is the following system of Isaacs equations : 

(46) 



max (a • Vr^(x)) + min (b ■ Vrj(x)) = , x G X 

oeA(i) bGB(x) 



where 



-r)(x) + max (a • Vu(x)) + min (b ■ \7v(x)) + \\x\\ 2 = , x G X 

aeAr,(x) b£B v (x) 



A ri (x) :=argmax (a ■ \7ri(x)) 

a£A(x) 



B r/ (x) :=argmin (b ■ \7r](x)) 

beB(x) 
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In classical pursuit-evasion games, such as in [BFS99J, the reward is constant and the value function 
is defined as the time (or the exponential of the opposite of the time) for the pursuer to capture 



the evader, then the value function is solution of the stationary Isaacs equation that is (46) with 
r) = 0, corresponding to the above Hamiltonian with 1 instead of In that case, the value is 

infinite when the pursuer's speed is smaller than the evader's speed, and it is would be difficult to 
compute an optimal strategy using Isaacs equation. Here by considering a mean-payoff problem, 
we may solve the problem even when pursuer's speed is smaller than the evader's speed, as we 
shall see below. Note that one may have kept the reward equal to 1, but then the optimal value rj 
would have given less information. 

A monotone discretization, for instance a finite difference discretization scheme (see [KD92 ), 



of System ( 46 ) yields to System ( 14 ) for the dynamic programming operator / of a discrete time 
and finite state space game, which then may be solved using our policy iteration Algorithm [2] 

In our tests, the domain X is discretized in each directions with a constant step size h. Then 
the two players of the discrete game are moving on the discretized nodes of the domain, similarly to 
the moves in a chess game. We assume also that the evader cannot move when the euclidean norm 
of the relative distance between him and the pursuer is less than 0.1, i.e when x G B((0, 0); 0.1). 
We shall call the evader, the mouse and his set of possible actions at each state of the game will 
given by : 



A(x) 



{(ai,oa)|oj G {0,1,-1}, 1 = 1,2} ieI\B((0,0);0.1) 
{(0,0)} i£B((0,0);0.1) , 



where X denotes the interior of X. The pursuer, that we shall call the cat, has the following set 
of possible actions : 

B(x) := {(h,b 2 ) | k G {0, 6, -b}, 1 = 1,2} x G X , 

where b is a positive real constant and represents the speed of the cat. Moreover, on the boundary 
of X, the sets A(x) and B(x) are restricted to avoid actions that bring the state out of X. 

Numerical results for this game are presented in Table [2] when b = 0.999 , b — 1 and b — 1.001 



respectively. Note that the solution of the discretization of Equation (46) may differ from the 
solution of the continuous equation. We observe that for b = 0.999 and b — 1.001, we have a 
strongly degenerate iteration for the first player on the last iteration. 

The optimal actions for the discretized problem with b = 0.999 are represented in Figure [5J at 
each node of the grid: the actions of the mouse are on the left, and that of the cat are on the right. 
The optimal actions are approximately the same for the two other values of b. When b = 0.999, 
the speed of the cat is smaller than the speed of the mouse (= 1). The numerical results for the 
discretized game give an optimal mean-payoff 77 such that rj(x) = 0.492 for x 6 X \ B((0,0);0.1) 
and rj(x) = for x G B((0, 0); 0.1). This means that the cat cannot catch the mouse when their 
starting positions are not too close and the mouse can keep almost the maximum distance between 
them. The relative value is represented on the left of Figure [6] When 6=1, the speeds of the 
cat and the mouse are equal. The numerical results for the discretized game give a relative value 
v approximately equal to zero for every starting point and an optimal mean-payoff T](x) ~ H^Hg: 
meaning that the cat and mouse keep the same initial distance all along the game. In the last 
example, the speed of the cat b = 1.001 is greater than that of the mouse (= 1). The numerical 
results for the discretized game give an optimal mean-payoff r\ close to zero. The relative value v 
is given on the right of Figure^ In this case, the cat catches the mouse. 
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Table 2: Numerical results for the mouse and cat example where b is the speed of the cat. The 
second column is the index of the iteration on the cat's policies and the third column is the 
corresponding total number of iterations on the mouse's policies. The last column indicates if the 
cat's policy iteration is strongly degenerate. Number of discretization nodes: 257 x 257. 
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(s) 
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2.59e + 01 







2 
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3.95e + 01 
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1 
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Figure 5: Optimal actions for the mouse on the left and for the cat on the right. 




Figure 6: Relative value v for the mouse and cat game when the speed of the mouse is one, the 
speed of the cat equals 0.999 on the left and 1.001 on the right. 
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A Details of implementation of Policy Iteration for multi- 
chain one player games 



We explain in more details here why System ( 33 1 has a unique solution and is selecting one special 



solution of System (37) with k instead of k + 1, and how it is solved practically (see also [DF68, 
IPut94p . 



Recall that System (37) is of the form (39) rewritten here: 



V 

X] + v 



Pr) 
Pv 



where r = r' 5 ' € 
corresponds to 



l n and P = P 1 -^ is a stochastic matrix, with 6 — Sk+i- Moreover, System (33) 




= 



i e M \ s , 

i € [n] , 
ieS, 



(47) 



where r and P are as before but with 6 = 5k , and where S is composed of minimal indices if of 
each final classes F of P. Then one need to show that System [47] has a unique solution and is 



selecting one special solution of System (39). 

First, for all final classes F of P, Ppp is an irreducible Markov matrix, hence the equation 
rjp = (Pt])f, which is equivalent to t]f = PffVf, is also equivalent to the condition that rji = rjj 
for all i,j £ F. Moreover, Pff has a unique stationary (or invariant) probability measure ftp, 
that is a row probability vector solution of Pp = it p Pff, and this vector has strictly positive 
coordinates. This implies that one can eliminate, for each final class F of P, one equation with 
index i in F in the equation r\ = Prj, without changing the set of solutions. Hence, any solution of 



System 47 is also solution of ( 39 1 



Second, denote by J- the union of final classes, and by T the union of transient classes, that 
is the complement in [n] of J- . Then, w is in the kernel of I — P if, and only if, it satisfies 
wp — PppWp, for all final classes F of P and wp = PppWp + PppWp. As said before the first 
equations are equivalent to the conditions tOj = Wj for all i,j £ F. Since Ppp has transient classes 
only, it has a spectral radius strictly less than one, which implies that given the vectors wf for 
all final classes F, the equation wp — Pppwp + PppWp has a unique solution wp. Hence, the 
dimension of the kernel of I — P is equal to the number of final classes of P, and any element of 
this kernel which has one coordinate i £ F equal to zero for each final classes F of P, has all its 
coordinates equal to zero. This implies that given rj G R™, the solution v of System (47) is unique 
if it exists (the difference between two such solutions satisfies the above conditions). This also 
implies that the codimension of the image of / — P is equal to the number of final classes. Hence, 
the image of I — P is exactly equal to the set of vectors rj £ R™ such that ttftjf = 0, for all final 
classes F of P. A vector rj £ R™ is such that System (47) has a solution v £ R" if and only if 
rj = Prj and rj — r is in the image of I — P. These conditions are equivalent to the three conditions 
rji = rjj for all i,j £ F, for all final classes F, rjp — Ppptjp + Ppptjp, and ftpr/p — uprp, for all 
final classes F. The first and third conditions together are equivalent to rji = ttftf, for all i £ F, 
and all final classes F of P, which gives a unique solution rjp. Since the second one has a unique 
solution rjp, given rjp, we get that there is a unique vector rj £ R n such that System (47) has a 
solution v £ R™. In conclusion, System (47) has a unique solution (rj,v), which finishes the proof 
what we wanted to show. 



One may try to solve System ( 47 ) by using usual LU methods, however when P is not irreducible, 
such a method is not robust. We rather use the decomposition of P in classes, and the previous 
properties. In particular, since Ppp has a spectral radius strictly less than 1, one can compute 
(r),v) solution of System (47) by first computing rjp and vp, for all final classes F of P, then 



computing successively rjp and vp which are respectively fixed points of contracting affine systems 
with tangent linear operator Ppp: 

rjp = Ppp rjp + Ppp rjp 
vp = Ppp vp + Pppvp + r T - rjp . 
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There exists two ways to compute rjp and vp. One is to compute the stationary probability irp 
to determine vf by Vi = nprp, for all icF, and then solve the following system with unknown 

v F e R F : 

v F = P FF vp+rp -i]p , 

by eliminating one equation (since one equation is redundant) with index j £ F, and adding the 
condition Vi = for one element i G F. Another method is to consider r\p as constant, say v% — fj 
for i S F, and solve the system with unknowns G R and v € R F : 



r/ + Vi 



jG-F 



i e F 



by adding the condition = for one element i 6 F. In our algorithm, we choose the index 
j = j £ F to be the minimal index of F (for a fixed total ordering of nodes). This method gives 
the following algorithm to solve System (47). 



Algorithm 7 (Solution of System (|47[)). Decompose the matrix P into irreducible classes and 
permute nodes without changing the order in each class, such that P takes the following form : 



P 



( Pll Pl2 

P22 


V ... 



^77 — 1,771—1 





Pit 

P2m 

Pm—l,m 
P 

1 m. Ill: 



\ 



where m denotes the number of irreducible classes and P/j are square irreducible submatrices of 
P, for I = 1, . . . , m. Note that, the class corresponding to a submatrix Ph is final if and only if 
the submatrices Pu are all null for all J 7^ I . 
For each class I from to to 1, do the following : 

Step 1. If I corresponds to a final class, that is Pu is a stochastic matrix, do one of the two 
following sequences of operations : 

A. (a) Find the stationary probability 717 of Pu: 717 Pu = 717 , 

(b) Set fj — ttj tj and rji — fj i 6 I , 

(c) Solve the system with unknown vj € M. 1 : 



= 



2j£j Pij V j 



B. Solve the system with unknowns uj € M. 1 and r/ G 

77 + 



= 



i € J , 
ie Sni , 



(48) 



and set r)i = f] i £ I , 

Step 2. if I corresponds to a transient class, that is if Pu is a strictly submarkovian matrix, do 
the following steps : 

(a) compute rjj solution of the following system : 

Vi = PuVi + X! Puvj 
J>i 

(b) compute Vj solution of the following system : 

vi = PuVi + ^ p iJ v -J + r i - Vi ■ 
j>i 
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In our numerical experiments, the linear system (33) at each intern policy iteration is solved 



by using Algorithm [7J For the numerical experiments of Section 7.1 on each final class, we used a 
SOR iterative solver to find the stationary probability 717 and also to compute the corresponding 
vj in method A in Step [T] of Algorithm [7] For the transient class, we used the LU solver of the 
package [DEG+99) . 

The Successive Over-Relaxation (SOR) method is an iterative scheme that belongs to the class 
of splitting methods or relation methods, see for instance [BP94 . It is derived from the Gauss- 
Seidel relaxation scheme. Consider a matrix A € M. nxn such that A = D — L — U where D, —L, —U 
are respectively the diagonal, lower and upper triangular part of A. The SOR smoothing operator 
is defined by S w = M~ 1 N where M = D - wL and N = [(1 - w)D + wll] for < w < 2. 

Consider the irreducible stochastic matrix Pjj £ K /x/ and decompose X — Pjj = D — L — U 



tlxl 



Starting from an initial positive approximation n 1 - ) € 



where X is the identity matrix of . 
a SOR smoothing step to find the stationary probability of Pjj is given by 



V 7T (fe) 

Z^ie [n] w i 



The sequence 

(^(2fe)) 

k>a converges to the transpose of the stationary probability of Pjj when 



the limit lim^oo sffl exists, see [BP94] for more details. To solve Equation (48), decompose 
X — Pjj = D — L — U . Then, starting from an initial approximation € R 1 , a SOR smoothing 
step consists in : 



,(fc) _ 



= (X - lfi) (S u 



x ) + M-\rj- m )) 



where 1 



fl...l 



and ix £ M. 1 is a row vector such that fa = 1 for i £ S fl /, and 



fa = otherwise. The sequence (v^)k>o converges to the solution of Equation (48) when the 
lim^oo Sw exists, see [BP94] for m ore d etails. 

For the numerical tests of Section 7.2 
cases. 



we used the LU solver of the package |DEG + 99] in both 



34 



